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Abstract 


This article presents a multilevel longitudinal nested logit model for analyzing correct response 
and error types in multilevel longitudinal intervention data collected under a pretest—posttest, 
cluster randomized trial design. The use of the model is illustrated with a real data analysis, 
including a model comparison study regarding model complexity and cluster bias. Two substan- 
tive research questions regarding the intervention effect on correct response probability and 
error patterns are investigated using the proposed model. The recovery of item parameters for 
the proposed model using two sample size conditions is examined via a simulation study. The 
accuracy of the parameter estimates is comparable with those found in previous studies for the 
same family of models, except for the intercept parameters of correct responses. Finally, the 
impact of ignoring cluster membership in the model on the parameter estimation is also studied 
by fitting a single-level model to multilevel data. Ignoring cluster membership in the model 
adversely affects the estimation of intercept parameters in correct and error responses. 
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Cluster randomized trials are experiments in which clusters of individuals, rather than indepen- 
dent individuals, are randomly allocated to intervention groups (Raudenbush, 1997). Due to the 
nature of the randomized clusters, this design forms a multilevel structure and thus possibly 
introduces dependence among individuals sampled within the same cluster. Therefore, when 
analyzing differences in outcomes between treatment and control groups, researchers typically 
use hierarchical linear modeling (HLM) to model this dependence. 

Among others, a pretest—posttest, cluster randomized trial design is one of the most common 
designs used in practice, and this design produces multilevel longitudinal data. Reflecting the 
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popularity of item response theory (IRT) model applications in educational and psychological 
testing programs, several researchers have extended IRT models for dealing with multilevel 
structures, called multilevel IRT models (e.g., Fox & Glas, 2001; Kamata, 2001; Mislevy & 
Bock, 1989; Rabe-Hesketh, Skrondal, & Pickles, 2004; Raudenbush & Bryk, 2002), and a mar- 
ginal maximum likelihood (MML) estimation and a Bayesian estimation have been used to esti- 
mate model parameters (see Fox & Glas, 2016, for a detailed review of multilevel IRT models 
and parameter estimation).' The multilevel IRT models enable researchers to study the impact 
of different predictors and response patterns in each level of multilevel data (e.g., Fox & Glas, 
2001). When the multilevel structure is ignored, aggregation bias (also known as the ecological 
fallacy) may occur, and bias in estimated measurement precision would be expected (e.g., 
Raudenbush & Bryk, 2002). 

In addition, IRT models have been applied to longitudinal categorical data (e.g., Andersen, 
1985; Cai, 2010; Embretson, 1991). The main purposes of these models are to model latent 
variable(s) at each time point or to model changes between time points. Recently, multilevel 
extensions of longitudinal IRT models—specifically, multilevel longitudinal IRT models— 
were presented by B. Muthén and Asparouhov (2016) and von Davier, Xu, and Carstensen 
(2011). These multilevel longitudinal IRT models are intended to accurately measure individual 
differences at each time point (B. Muthén & Asparouhov, 2016) or individual differences in 
changes over time and to detect group differences (von Davier et al., 2011), by accounting for 
the multilevel data structure. Acquiring more accurate, individual differences in changes over 
time ensures better assessment of the intervention effect, which is often the major interest in a 
cluster randomized trial design. 

When categorical outcomes, such as item responses for multiple-choice items and 
constructed-response items, are collected to measure an intervention effect, either dichotomous 
or polytomous IRT models can be fitted to the data, depending on the scoring scheme of the 
responses. Although binary scoring—that is, modeling correct responses using dichotomous 
IRT models—is a more common practice, modeling incorrect responses (e.g., distractors in 
multiple-choice items and error types in constructed items) in addition to the correct responses 
can be desirable to improve measurement accuracy. Research on IRT has highlighted the poten- 
tial value of modeling polytomous responses, as opposed to binary responses, for purposes such 
as estimating the ability parameter (e.g., Baker & Kim, 2004), recovering linking coefficients 
(e.g., Kim, 2006), and examining differential distractor functioning (e.g., Suh & Bolt, 2011). 

Analyzing incorrect responses (e.g., error types) has become an active research area, particu- 
larly in mathematics education (e.g., Charalambous & Pitta-Pantazi, 2007; Clarke & Roche, 
2009), as this analysis can (a) identify the patterns of errors or mistakes that students make in 
their work, (b) explain why students make the errors, and (c) provide targeted instructions to 
correct these errors. Especially when clusters are the unit of randomization, as typically is the 
case in data sets from cluster randomized trials, treatment assessment can be expensive. 
Therefore, it is critical to extract additional information from item response data to improve 
assessment efficiency. 

Despite the practical importance of using the information from incorrect responses, the exist- 
ing extensions of multilevel longitudinal IRT models (e.g., B. Muthén & Asparouhov, 2016) 
are limited to modeling correct responses. In addition, current practices for analyzing incorrect 
responses or error types in multilevel longitudinal data rely on non-IRT-based analyses, using, 
for instance, a frequency table for error types or adopting HLM for error analysis (e.g., Bottge, 
Ma, Gassaway, Butler, & Toland, 2014). HLM analyses are often conducted with sum scores 
(or total scores). Based on previous research, a major concern of using total scores in HLM is 
the measurement error in the sum scores. Measurement error in the sum scores is responsible 
for biased parameter estimates and loss of power for detecting relationships among variables 
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(e.g., Fox & Glas, 2003). One method for alleviating this concern is using latent variable mod- 
els (e.g., Cole & Preacher, 2014), such as IRT models. 

Before analyzing multilevel longitudinal data to make inferences about the intervention 
effect, whether the construct of interest is the same across all levels (1.e., individual and clus- 
ter levels) must be checked. This can be examined by testing whether the item discrimina- 
tions are equal across all levels. If the item discriminations are equal, then the model 
formulation with the same discrimination across all levels assumes one factor, one latent 
variable. In this case, the latent variable at the cluster level can be interpreted as the cluster 
mean of the latent variable at the individual level. If the item discriminations are not equal, 
then the model with different item discriminations over the respective individual and cluster 
levels cannot assume a common latent variable, and consequently, the individual- and 
cluster-level latent variables cannot be interpreted in the same way across all levels (B. 
Muthén, 1990). If this occurs, it is often referred to as cluster bias (Jak, Oort, & Dolan, 
2013). For example, when test data come from students nested within teachers to measure 
teacher effectiveness, cluster bias means that the factors may be different over levels, such 
as students’ ability at the student level and teaching strategy at the teacher level (as the char- 
acteristics of classrooms). A simulation study showed that the item discrimination estimates 
and standard errors of a (cross-sectional) multilevel IRT model were biased when cluster 
bias was ignored (Lee & Cho, 2015). The biased item discrimination estimates can result in 
the misinterpretation of latent variables and biased IRT scores. 

The purpose of the present study is twofold: to specify an IRT model for analyzing the 
correct response and error types in multilevel longitudinal intervention data collected under 
a pretest—posttest, cluster randomized trial design and to illustrate the use of the specified 
model through a real data analysis. In the real data analysis, a series of analyses are con- 
ducted using the model. First, a model comparison analysis for choosing the best-fitting 
model is described among candidate models: that is, to determine whether the multilevel 
longitudinal model is preferred over the longitudinal model and whether a measurement 
model with cluster bias is preferred over a model without cluster bias. Second, measurement 
invariance across groups and time points is briefly examined as a preliminary analysis before 
inferences are made about the intervention effects. Then, the authors illustrate how the 
model can be used to detect the intervention effects by examining the changes in the prob- 
abilities of correct response and each error type after the intervention. Two substantive 
research questions are investigated in an instructional intervention study using the specified 
model. Finally, the recovery of item parameters for the proposed model and the impact of 
ignoring cluster membership in the model on the parameter recovery are examined via a 
simulation study. 


Multilevel Longitudinal Nested Logit Model 


When item responses are nominally scored, perhaps the most common IRT approach is 
reflected by Bock’s (1972) nominal response model (NRM), which is a divide-by-total 
model (Thissen & Steinberg, 1986). As an alternative to the NRM, Suh and Bolt (2010) pro- 
posed two-parameter logistic (2PL) nested logit model (NLM) and three-parameter logistic 
(3PL) NLM, referred to as 2PL-NLM and 3PL-NLM, respectively. Unlike existing models 
(the divide-by-total models), the models possess the attractive statistical property of cate- 
gory collapsibility regarding incorrect responses. This property allows direct evaluation of 
the probabilities of incorrect responses, independent of the correct response probability. 
That is, the NLMs use a traditional functional form for the correct response, such as 2PL 
probability, while investigating the functions of incorrect responses in contexts in which 
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practitioners might ultimately intend to use binary item scoring. This property also provides 
advantages for examining measurement invariance issues in correct responses and incorrect 
responses separately (e.g., Suh & Bolt, 2011). In this study, the 2PL-NLM is adopted as a 
measurement model for analyzing multilevel longitudinal data. 

Suppose a test is composed of / items and each item has one correct response category and 
M incorrect response categories. Let y,; represent an item response for item 7 by examinee j once 
keyed for correctness (i.e., yj; = 1 if correct, and 0 otherwise). Furthermore, let d;,, denote an 
item response for incorrect response category v (v= 1,2, ...,M) such that dj, = 1 when exami- 
nee j shows incorrect category v of item 7, and 0 otherwise. Under the 2PL-NLM (Suh & Bolt, 
2010), the probability that an examinee who has ability 0; chooses the correct response category 
for item 7 is modeled as a traditional 2PL model: 


_ exp(B; + «6;) 
1+ exp(B; +.0,6;) ° 


P(yy= 1)0)) (1) 


where 8, denotes the intercept parameter, and a; is the slope parameter for item i. The probabil- 
ity that the examinee shows incorrect response category v is modeled as the product of the prob- 
ability of getting the item wrong and the probability of selecting incorrect category v: 


P(yi = 0, diy = 1|0;) = P(vj =0|0;) P (div = 1 yy = 0, 0) 


= [ exp(B; + 0:6;) | exp (iy + hivO) (2) 
1+ exp(; +0;0;) 3 exp (Liz + dix ;) 
k=1 


where Cs and Xs are the intercept and slope parameters for the incorrect responses, respectively. 
The second bracketed term in Equation 2 has the same form as Bock’s (1972) NRM; but the 
difference is that the denominator is calculated only in terms of incorrect responses. Following 
Bock (1972), an arbitrary linear restriction is imposed for the incorrect response parameters as 
vei G+ Anv9,) =O, implying that 17)" Ci, =0 and YL, hiv =0. 

Taking into account a multilevel longitudinal structure, a multilevel longitudinal extension of 
the 2PL-NLM (hereafter, briefly called a multilevel longitudinal nested logit model [MLNLM]) 
is presented as follows. Assuming a two-level data structure in this illustration (e.g., students 
nested within teachers), the correct response probability is given as follows: 


exp(Bir + 04 1Djor + 0;9g1) 


P(vift = 1 |0j¢r, Ber) = 1+ exp (Bi, + 1:9 jer + 02:14 gr) , 


(3) 


where i and j are item and person indices, respectively, as in Equations | and 2. g is the index 
for a cluster (such as a teacher), ¢ is the index for a time point, 0j,; is an individual-level latent 
variable (Level 1), Og is a cluster-level latent variable (Level 2), aj; is an item discrimination 
parameter for O9jor, 2; is an item discrimination parameter for 0g, and B,, is an item intercept 
parameter.” The cluster-level latent variable can be interpreted as the cluster mean of the 
individual-level latent variable in the case of a 1;;= «a; for all items. The meaning of the latent 
variables over levels can differ otherwise. 

Under the MLNLM (with a group invariance assumption), the probability that the examinee 
shows incorrect response category v is modeled as follows: 


Suh et al. 77 


P( viz =0, dijve = 1] jet, Der) =P(yizt = 0/8 je, Ber) P(dinu = 1 [vie = 0, Ojer, Ber) 


exp (Bir + ait Djor + 02191) exp (Civt + ive Djer + NaiveDer) (4) 


+ +044: 9jo1 + 012; M ; 
1 exp(Bir 01 i Diet 02:0 gr) + exp (Cite + rims Ojer + doiti Der) 
k=l 


where v is the index for incorrect response categories, assuming group invariance as in Equation 
3. Ajav and Azay are the item discrimination parameters of a category v for Oj, and Og, respec- 
tively, and ¢;,, 1s an intercept category parameter. Similar to a); and aj, in Equation 3, Aj; and 
Noi Can be identical if there is no cluster bias.* 

Given there are T time points, the individual-level latent variables across time points, 
Dic = [Bje1, Djg2, -- + Ojer)', are assumed to follow a multivariate normal (MN) distribution, 
0j¢~MN(Byrx1)> Xrx7)). Likewise, the cluster-level latent variables across time points, 
Oe = [Oe1, 92, ---, Or’, are assumed to follow an MN distribution, 0.~MN(py7x1), 227x7))- 
Meanwhile, 2irxr) and Yyrxr) are unrestricted variance—covariance matrices. For model 
identification, Purx1) is set to 0 and the variances of Sir r) are set to 1 with an assumption of 
cluster invariance. When the item discrimination parameters are the same across levels (i.e., 
cluster bias does not exist), the variances of 277) can be estimated. The variances of La7x7) 
are also set to 1 when the discrimination parameters are different between Levels | and 2 (i.e., 
cluster bias exists). 


Bayesian Estimation 


To estimate the MLNLM, Bayesian estimation is used because of the high complexity of the 
models. WinBUGS 1.4.3 (Spiegelhalter, Thomas, Best, & Lunn, 2003) is used to implement a 
Markov chain Monte Carlo (MCMC) algorithm. The prior distributions of the model parameters 
are selected to resemble those used in previous studies (e.g., Bolt, Wollack, & Suh, 2012; Patz 
& Junker, 1999) and are presented here in WinBUGS specification. The prior distributions of 
the item parameters are a1;,~ logNormal(0, 2), a&2~ logNormal(0, 2), B;,~ Normal, 0.5), 
Maw~ Normal, 0.5), N2inv~ Normal, 0.5), and C;,,~ Normal(0, 0.5). Assuming the number 
of time points (7) is fixed at 2 in this application, the prior distribution of the individual-level 
latent variables follows a bivariate normal distribution (BN): Oj. = [8je1, 8je2|'~ BN(O, X1x2)); 
where the variances of 2122) are set to 1, while the correlation of Y1x2), PO 01 8je2 
Uniform(—1, 1). The prior distribution of the cluster-level latent variables is 0. = [@¢1, Beo]'~ 
BN(Pyax1) 2@x2)), where the variances of S2x2) are set to 1, when the item discrimination 
parameters are different between Levels | and 2, while the correlation of 22(2x2)> P.18,0 ~ 
Uniform(—1, 1). When the item discrimination parameters are the same across all levels, the 
prior distribution of X22) is as follows: Yoax2)~ Wishart(Q2x2, 2), where Q2x2 is a unit 
matrix of size 2 and degrees of freedom 2 as the rank of 0,. The hyperprior distribution on the 
mean is set at w~N(0, 1) for each time point. The convergence of the MCMC solution is evalu- 
ated with the Gelman and Rubin (1992) method with two chains. 


Bayesian Model Fit 


A relative fit criterion, the deviance information criterion (DIC; Spiegelhalter, Best, Carlin, 
& van der Linde, 2002), is used as a tool for comparing the fit of the models being consid- 
ered. The DIC is calculated as DIC=D(%)+pp, where 9 is a set of parameters, D(9) is the 
posterior mean of the deviance (i.e., a Bayesian measure of fit), and pp is called the effective 
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number of parameters (i.c., a Bayesian measure of complexity). pp is defined as 
Pp=D(%) — D(%), where ® is the posterior estimates of the parameters, and D(¥) is the 
deviance obtained from the posterior estimates of the parameters. The DIC can be easily 
computed by obtaining the two deviance values, D(®) and D(®), as the outcomes of the 
WinBUGS runs. A smaller DIC indicates a better fit, and a difference of less than 5 or 10 
units between the models does not provide sufficient evidence for favoring one model over 
another (Spiegelhalter et al., 2003). 


Empirical Illustration: Effects of Enhanced Anchored Instruction 
Study Design, Samples, and Measures 


This study analyzed data obtained via a host study funded by the Institute of Education Sciences 
(U.S. Department of Education). The goal of the host study was to evaluate the efficacy of 
enhanced anchored instruction (EAI), which was designed by Bottge and his colleagues. For 
more details on the study, the full report can be found in Bottge, Ma, Gassaway, Toland, et al. 
(2014). The design of the study is a pretest—posttest, cluster randomized trial to compare the 
effects of two instructional conditions (EAI vs. business as usual [BAU]) on students’ computa- 
tion and problem-solving skills. 

A total of 24 urban and rural middle schools in the Southeast participated in the study. 
Schools were randomly assigned to EAI and BAU (12 EAI and 12 BAU schools). Each 
school had one math classroom except one school, which had two math classrooms. A total 
of 471 students were in the initial sample. Of these students, 25 students did not respond to 
all items on the pretest or posttest. Consequently, 232 BAU students and 214 EAI students 
remained in the final sample. Twenty-nine percent of the BAU students had math difficulties 
(MD), while 26% of the EAI students had MD. The smallest number of students analyzed 
for each classroom (i.e., teacher) was seven, and the largest was 28. The data showed a two- 
level structure. That is, students were nested in teachers. Based on the chi-square tests of 
equal proportions, the students were comparable between the EAI and BAU groups in terms 
of gender, ethnicity, subsidized lunch, and disability area, and the teachers were also compa- 
rable in gender, ethnicity, and education level (see Bottge, Ma, Gassaway, Toland, et al., 
2014, for the results). 

A researcher-developed test called the Fraction and Computation Test (FCT) was used 
to measure students’ ability to add and subtract factions. The tests that had 20 items were 
administered twice, as a pretest (Time 1) and a posttest (Time 2). The 20 items possess dif- 
ferent item attributes°: Students were asked to add and subtract fractions with /ike denomina- 
tors (e.g., 1/5+2/s) or unlike denominators (e.g., 5/6 — 1/3), simple fractions (e.g., 5/g+!/4) 
or mixed numbers (e.g., 41/\6+!/g+1/), and two stacks (e.g., 3/4/1/2) or three stacks (i.e., 
one more stack in the two-stack example). For the specified attributes for each item, see 
Table 2. 

For 18 of the 20 items, students could earn 0, 1, or 2 points. For the other two items, students 
could get 3 points if they simplified the correct answers (i.e., reduced the fraction to a simple 
term). However, less than 1% of the students in the sample received partial scores on any of the 
items. In this study, partial scores were considered incorrect responses. Interrater agreement was 
99% on the pretest and 97% on the posttest. Raters also identified and coded the primary error 
types the students made for each incorrect item. A total of 11 error codes® were generated and 
labeled as shown in Appendix A as an online supplement. There were no missing item responses 
in the data set. 


Suh et al. 79 


Table |. Model Comparison Results. 


Model DIC 

Longitudinal NLM 30,110 
MLNLM with cluster invariance 30,010 
MLNLM without cluster invariance 29,360 


Note. DIC = deviance information criterion; NLM = nested logit model; MLNLM = multilevel longitudinal nested logit 
model. 


Model Comparisons 


Using the FCT data, models were compared in terms of model fit with the DIC.’ To analyze the 
data with 11 error types (categories) and one correct answer, three models® were fitted. The first 
model was a longitudinal NLM without a multilevel structure (i.e., Equations 3 and 4 without 
the term 0,,). The second model was the MLNLM with an assumption of cluster invariance (i.e., 
Equations 3 and 4 with aj; = a2), and Ajj = Aziz). The third model was the MLNLM without an 
assumption of cluster invariance (i.e., Equations 3 and 4 with aj, 4 ao; and Ajay A Azivr). 
Based on the convergence check results, a conservative burn-in of 6,000 iterations was used, fol- 
lowed by 4,000 postburn-in iterations for all analyses. Table | provides the model comparison 
results. The MLNLM without the cluster invariance assumption appears to be the best-fitting 
model (with a DIC value of 29,360) among the three models.” Therefore, the MLNLM with 
cluster bias was chosen as the base model for subsequent analyses in this study. As explained 
earlier, when the cluster invariance assumption does not hold (i.e., ay, A Oa and Aji A Naive), 
the cluster-level latent variable (6,,) is not the cluster mean of the individual-level latent vari- 
able (0j;) that can be labeled as students’ ability to add and subtract factions in this example. 
Because an instructional intervention EAI was implemented at the cluster (i.e., teacher) level, 
the cluster-level latent variable (0g,) could be interpreted as the EAJI-related student ability, 
which varied across teachers (i.e., EAI effect for EAI students vs. no EAI effect for BAU 
students). 


Sensitivity Analysis 


Before inferences were made about intervention effects with the FCT data, group and time 
invariance assumptions were investigated as a preliminary analysis under the MLNLM with 
cluster bias (the model selected from the model comparison study). The detailed procedures are 
described in Appendix B as an online supplement. As a result of the measurement invariance 
study, no group bias was found at the cluster- and individual-group levels. For the time invar- 
iance study, bias in the intercept parameter for the correct response (8 ;,) was found between the 
two time points. To assess the impact of having the bias in the intercept on the results of the 
latent variables (0s), a sensitivity analysis was conducted. The goal of the analysis was to exam- 
ine the consequences of ignoring the bias by fitting the strong invariance model versus the weak 
time invariance model for the correct response (i.e., bias in the intercept). The sensitivity analy- 
sis was performed in terms of the estimates (i.e., posterior means from WinBUGS outputs) of 
the individual- and cluster-level latent variables (0j¢; and 4) obtained from both models.'” 

By fitting each model, two individual-level latent variables, 8j.; for the pretest (Time 1) and 
Qj.2 for the posttest (Time 2), and two cluster-level latent variables, 6,; for the pretest and 0,9 
for the posttest, were estimated as the outcomes of the model. First, correlations were computed 
between the weak and strong invariance models for each latent variable type. The correlation 
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coefficients were sufficiently high: .99, .99, .99, and .98 for 0j¢1,''8je2, 91, and 6,2, respec- 
tively. Moreover, within each model, difference scores were calculated as 02 — 9j¢1 (i.¢., postt- 
est — pretest) for the individual-level latent variable and Bo — Bet (i.e., posttest — pretest) for 
the cluster-level latent variable. The means and variances for each difference score were very 
similar across the models.'* Finally, two ¢ tests were conducted using the means and variances 
of the difference scores from the two models: one test for the difference between the two mod- 
els using the two means of 6j2 — 6je1 (i-e., an independent samples ¢ test with the null hypoth- 
esis saying that the mean of the difference score from the weak invariance model is the same as 
the mean from the strong invariance model) and another test for the difference between the two 
models using the two means of 6 e2 — Ber. For both tests, the t-test results—t = 0.58 (p = .56, 
df = 890) for the individual level and ¢ = 0.90 (p = .38, df= 48) for the cluster level—showed 
that the means of the difference scores from the weak invariance model were not significantly 
different from the means from the strong model at both levels. 

These results suggested that although bias in the intercept parameter for the correct response 
(B;,) was found between the two time points, fitting the strong invariance model (1.e., ignoring 
the bias) did not yield substantially different results from fitting the weak time invariance model 
for the correct response. In addition, Verhagen and Fox (2013) noted that measurement invar- 
iance should be at least weakly invariant to make comparisons across groups and to measure 
changes across time points. Therefore, it would be reasonable to consider one-group MLNLM 
(the strong invariance model) with cluster bias as the final measurement model for subsequent 
analyses. 


Instructional Effects With the Results of the MLNLM 


In this section, the instructional effects on the probabilities of correct response and each error 
type were examined using the (one-group) MLNLM with cluster bias, as the final model based 
on the model comparisons and the measurement invariance investigations. The substantive 
research questions in this section were as follows: (a) What are the differential effects of BAU 
and EAI on the correct response probability and error patterns? (b) What are the differential 
effects of BAU and EAI within the non-MD and MD groups on the correct response probability 
and error patterns? 

Table 2 provides only the item parameter estimates (posterior means) and a 95% credibility 
interval (CI) for the correct response (1;, Q2;, and B,) due to page limits. The item parameter 
estimates for the error responses (tive oe and C¢;,,,) are shown in Appendix C as an online sup- 
plement. Items that had a /ike denominator were less discriminating and easier than items that 
had an unlike denominator. No particular pattern for the other item attributes was found. The 
means and standard deviations (SDs) of the estimated latent variables (0j¢; and 6,/) at each time 
point were obtained—0jo1 = = — 0.83, 8jg2 =3. 17, 6.1 = — 3.31, 8.2.= — 3.21; SD(6j-1)=6.49, 
SD(8je2) = 6.33, SD(8g1)=0.69, and SD(6.2)=0.90. The difference in the mean of each latent 
variable between the posttest (Time 2) and the pretest (Time 1) was calculated as follows: 
Ojg2 — Ojg1 =3.17 — (— 0.83) =4.00 and O42 — Ogi = — 3.21 — (— 3.31) =0.10, implying there 
were increases in the individual-level latent variable (i.e., students’ ability to add and subtract 
fractions) and in the cluster-level latent variable (i.e., EAI-related students’ ability), on average, 
after the intervention, respectively. 

To answer the research questions, the correct response probability and the probability of 
each error type were calculated using the item and person parameter estimates of the MLNLM. 
Regarding the differential effects of BAU versus EAI, the differences in the probabilities 
between the BAU and EAI groups were calculated across 12 response categories (one correct 
response and 11 error types) for 20 items at each time point. The difference was calculated as 
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Table 2. Correct Response Item Parameter Estimates and 95% Credibility Intervals under the MLNLM 
with Cluster Bias. 


Attributes 
Item Operation Denominator Type Stacks G4; (95% Cl) Qj (95% Cl) B; (95% Cl) 
I Addition like simple 2 0.26 [0.21, 0.30] 0.27 [0.14, 0.43] 2.68 [2.19, 3.22] 
2 Addition like simple 2 0.29 [0.24, 0.34] 0.23 [0.12, 0.37] 2.83 [2.36, 3.34] 
3 Addition unlike simple 2 0.80 [0.67, 0.94] 0.99 [0.64, 1.32] 0.92 [—0.03, 1.94] 
4 Addition unlike simple 2 0.67 [0.57, 0.79] 0.95 [0.66, 1.24] 1.06 [0.25, 1.99] 
5 Addition unlike simple 2 0.71 [0.59, 0.84] — 1.03 [0.72, 1.37] 0.32 [—0.59, 1.22] 
6 Addition unlike simple 2 0.93 [0.77, 1.12] 1.06 [0.70, 1.45] 0.24 [—0.78, 1.41] 
7 Addition unlike mixed 2 0.71 [0.60, 0.83] 0.84 [0.58, 1.17] 0.01 [—0.84, 0.98] 
8 Addition unlike mixed 2 0.59 [0.50, 0.69] 0.68 [0.42, 0.96] —0.41 [—1.24, 0.47] 
9 Addition unlike mixed 2 0.77 [0.65,0.91] 1.05 [0.72, 1.45] 0.00 [—0.86, 0.94] 
10 Addition unlike mixed 2 0.68 [0.57,0.81] 1.14 [0.82, 1.48] 0.44 [—0.37, 1.37] 
II Addition unlike simple 3 0.60 [0.52, 0.71] 0.85 [0.6], 1.11] 0.70 [—0.13, 1.52] 
12 Addition unlike simple 3 0.69 [0.58, 0.82] 1.01 [0.70, 1.34] 0.43 [—0.52, 1.34] 
13 Addition unlike mixed 3 0.67 [0.56, 0.79] 0.82 [0.55, 1.08] —0.73 [—1.50, 0.00] 
14 Addition unlike mixed 3 0.58 [0.48, 0.69] 0.96 [0.67, 1.25] | —0.32 [—1.14, 0.43] 
15 Subtraction _ like simple 2 0.19 [0.15, 0.23] 0.21 [0.11, 0.34] 2.28 [1.90, 2.79] 
16 Subtraction — unlike simple 2 0.64 [0.54, 0.75] 0.85 [0.56, 1.11] 0.05 [—0.74, 0.74] 
17 Subtraction _ like mixed 2 0.16 [0.13,0.19] 0.19 [0.11, 0.30] 1.16 [0.85, 1.58] 
18 Subtraction — unlike mixed 2 0.29 [0.24, 0.35] 0.81 [0.58, 1.08] | —0.08 [—0.77, 0.77] 
19 Subtraction — unlike mixed 2 0.53 [0.45, 0.63] 0.87 [0.58, 1.14] —0.21 [—1.01, 0.57] 
20 Subtraction — unlike mixed 2 0.32 [0.26, 0.39] 0.79 [0.51, 1.09] —1.23 [—2.08, —0.40] 


Note. Cl = credibility interval; MLNLM = multilevel longitudinal nested logit model. 


the probability in BAU minus the probability in EAI. For easier representation, the differences 
in probabilities between BAU and EAI were plotted for each item. Positive values were 
expected for the differences for the error types and negative values for the correct responses on 
the posttest (Time 2) in the presence of instructional effects. The differences were expected to 
be close to zero on the pretest (Time 1), in which instruction had not been introduced. 

After the plots of all items were inspected, the anticipated pattern was observed. An inter- 
esting finding was that the pattern was more obvious in a certain attribute over the other attri- 
butes. For example, Figure la shows plots for /ike items and unlike items. The differences 
were averaged over /ike items (four items) and the unlike items (16 items) separately and are 
displayed in the figure. A common pattern across the two plots is that the differences in the 
probability between BAU and EAI on the pretest were minimal (see the line for Time 1), but 
there were instructional effects between EAI and BAU students in making the errors and 
endorsing the items (see the line for Time 2). In particular, the difference in the probability 
of the Combining (C: Student combines numerators together and denominators together) 
error was the highest among the 11 error types, implying that the EAI students noticeably 
reduced the C error from the pretest compared with the BAU students. Furthermore, the error 
probability for the Other (O: Student makes errors other than those labeled) error was 
improved (i.e., reduced) for the EAI students. For the correct response probability, the prob- 
ability increased after instruction. These patterns were more apparent in the unlike items than 
in the Jike items. For the unlike items, the probability of the C error for the EAI students was 
0.15 lower than that for BAU students, and the EAI students showed a 0.25 higher probability 
of solving the item correctly than the BAU students after instruction. Regarding the addition 
versus subtraction items, the addition items showed a clearer pattern than the subtraction 
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Figure |. (a) Average differences (the probability in BAU minus the probability in EAl) in the response 
probabilities for like items (left) and unlike items (right); (b) average differences (the probability in BAU 
minus the probability in EAI) across all items within non-MD (left) and MD (right). 

Note. BAU = business as usual; EAl = enhanced anchored instruction; MD = math difficulties; C = combining; AA = add 
all; SD = select denominator; AC = adding components; EQ = equivalent fraction error; CE = computation error; 
WO = wrong operation; O = other; L/S = large—small; RN = renaming; NR = no response; CR = correct response. 


items. A previous study showed that fewer C errors by students in the EAI group contributed 
to significant improvements in adding-related items overall (Bottge, Ma, Gassaway, Butler, 
et al., 2014). No difference was observed between the simple and mixed items. Finally, com- 
pared with two-stack items, three-stack items presented results that were more similar to 
those for the unlike items and addition items. 

The differential effect of BAU and EAI on the correct response probability and error prob- 
abilities was examined within the non-MD versus MD groups. The differences were averaged 
across all 20 items and plotted. Figure 1b shows the average differences within each group. 
Similar patterns found in Figure 1a were observed. Interestingly, in the non-MD group, the EAI 
students made fewer Select Denominator (SD: Student selects one of the denominators listed in 
the problem and makes no attempt to make an equivalent fraction) errors than the BAU stu- 
dents. In the MD group, the EAI students made more C errors on the pretest than the BAU stu- 
dents. However, the positive effect of improving performance on C errors was greater for the 
MD group than for the non-MD group. In other words, the absolute difference in the probability 
of C errors between Time 2 and Time | was higher in the MD group than in the non-MD group. 
The instructional effect on the correct response probability was about 0.1 higher in the non-MD 
group than in the MD group. Regarding item attributes, the addition, unlike, and three-stack 
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items showed more evident patterns, as in Figure 1b, compared with the subtraction, Jike, and 
two-stack items, respectively. No difference was found between simple and mixed items. 


Parameter Recovery Study 


To evaluate the parameter recovery of the MLNLM with cluster bias chosen as the final mea- 
surement model, a simulation study was conducted. First, using the parameter estimates 
obtained from the real data analysis as the true parameters, item responses for 446 examinees 
nested within 25 clusters and 20 items (as found in the empirical study) were generated under 
the MLNLM. Second, to investigate the effect of increasing the sample size on the parameter 
recovery, the cluster size was increased from 25 to 50, and the number of examinees within 
each cluster was also doubled, resulting in 892 examinees in total. Thus, the two sample sizes 
were simulated (small vs. large in relative terms) under the MLNLM. For each sample size con- 
dition, 30 replications were simulated. Finally, the effect of ignoring cluster membership on the 
parameter estimation was examined additionally by fitting a single-level NLM instead of the 
MLNLM. This was done only in the large sample condition. In all simulation conditions, the 
true model was the MLNLM. 

As in the real data analysis, chains were simulated to 10,000 iterations with a burn-in of the 
first 6,000 iterations. The prior distributions were identical to those used in the real data estima- 
tion. The convergence of the MCMC solution was evaluated with the Gelman and Rubin 
(1992) method with two chains. Recovery results were summarized in terms of relative bias 
and the root mean square error (RMSE) of the item parameter estimates relative to the true 
parameters. For example, for aj ;, the relative bias and the RMSE were obtained with 


ee (G1; — &4;)/04;|/30 and so (G1; — a1;)° /30, respectively. For the sake of interpre- 


tation, the relative biases and the RMSEs’° for individual error categories were later collapsed 
across the error response categories, as well as items, to create an average value for each cate- 
gory parameter type. The relative biases and the RMSEs for the correct response category para- 
meters were averaged across items. The relative bias implies the accuracy of the parameter 
estimates, and the RMSE combines the parameter bias and precision into an overall measure of 
accuracy. The authors considered that values of relative bias of larger than |0.2| were unaccep- 
table (e.g., Forero & Maydeu-Olivares, 2009). 

Table 3 reports the recovery results for the six item parameters, o1;, @2;, B;, tiv, Aviv, and 
Ci». When the MLNLM was fitted, the relative biases for all item parameters were not larger 
than |0.2|. Based on the RMSE values, the parameter recovery of the MLNLM with the large 
sample condition improved slightly in some parameters (1.€., @1;, Q2;, Atjy, and C;,,) compared 
with the recovery with the small sample condition. In other parameters, the RMSE stayed the 
same (Ap;,) or unexpectedly increased (8;). Given the large number of categories (12), the 
RMSEs for most category parameters appeared comparable with those seen in previous studies 
under the 2PL (e.g., Patz & Junker, 1999), the 2PL-NLM (Bolt et al., 2012; Suh & Bolt, 2010), 
and the NRM (Wollack, Bolt, Cohen, & Lee, 2002). For the intercept parameters, 8, and G,,,, 
relatively high RMSEs were observed; however, the results can be reasonably compared to the 
study by Wollack et al. (2002), in which two sample sizes of 300 and 500 examinees and three 
test lengths (10, 20, and 30 items) were used. Four response categories, including the correct 
response, were considered, and MML and MCMC estimation methods were implemented by 
Wollack et al. (2002). Using a 20-item condition with the MCMC estimation, Wollack et al. 
(2002) reported that the average RMSEs for the slope parameters across item categories under 
the NRM ranged from 0.14 to 0.29, and the average intercept RMSEs ranged from 0.13 to 0.46 
depending on ability levels. 
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Table 3. Item Parameter Recovery Results. 


Model QI Q2; B; Aliv driv Civ 
Relative bias 
MLNLM_SM 0.02 0.12 —0.02 —0.13 —0.11 —0.20 
MLNLM_LG —0.01 —0.06 0.17 —0.18 0.17 —0.06 
NLM_LG 0.02 —1.10 —0.14 —0.43 
RMSE 
MLNLM_SM 0.05 0.13 0.35 0.15 0.24 0.41 
MLNLM_LG 0.03 0.09 0.45 0.12 0.24 0.34 
NLM_LG 0.03 0.99 0.11 1.00 


Note. MLNLM = multilevel longitudinal nested logit model; MLNLM_SM = fitting the MLNLM to the small sample size 
condition; MLNLM_LG = fitting the MLNLM to the large sample size condition; NLM_LG = fitting a single-level NLM 
to the large sample size condition; NLM = nested logit model; RMSE = root mean square error. 


Finally, when the single-level NLM was fitted to the large sample size condition (1.e., when 
cluster membership was ignored), the relative biases for the intercept parameters in the correct 
and incorrect response categories (i.e., B; and ¢;,) were substantially larger than |0.2|. In addi- 
tion, the RMSEs for B; and ¢;, with the single-level NLM were more than twice those with the 
MLNLM, yielding values of 0.99 and 1.00. However, ignoring the multilevel structure did not 
seem to affect the estimation of the slope parameters. 


Discussion and Conclusion 


The MLNLM was presented to analyze nominal responses under a pretest—posttest, cluster ran- 
domized trial design, which is one of the commonest designs used in educational intervention 
studies. Using the real data analysis, this article illustrates the application of the model to a 
model comparison study for selecting the best-fitting measurement model and for examining 
cluster bias (and measurement invariance over time points and between group variables at the 
individual and cluster levels, respectively, in the online supplement). Based on the model com- 
parison and measurement invariance studies, the (one-group) MLNLM with cluster bias was 
chosen as the final model for examining the instructional effects of BAU and EAI on the cor- 
rect response and error response probabilities. A parameter recovery study was conducted by 
mimicking the conditions of the real data set as the simulation design. The effect of increasing 
the sample size was also examined by doubling the sample size in the real data set. The accu- 
racy of the item parameter estimates under the MLNLM was comparable with those found in 
previous parameter recovery studies given the large number of response categories. As the sam- 
ple size increased, the parameter recovery improved slightly in some parameters (04;, 02;, \tiys 
and ¢,,,) but not in others (8; and Xz;,). In this regard, further research on the parameter recovery 
of the MLNLM might be needed in relation to changes in the sample size and the number of 
response categories. Finally, ignoring cluster membership adversely affected the estimation of 
the intercept parameters. 

Future studies with the MLNLM can address various issues. First, the DIC was used to 
examine the model’s fit for the model comparison and measurement invariance checks as used 
in other multilevel IRT studies (e.g., Cho & Bottge, 2015; Fox, 2010). However, the perfor- 
mance of the DIC has not yet been examined under the current application of the MLNLM in 
terms of identifying the correct model and detecting the sources of measurement variance. 
Particularly when the sample size and the number of clusters are small under hierarchical 
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models, the asymptotic properties of the DIC and robustness to nonnormal distributions can be 
impaired. The DIC also tends to choose overfitted models (i.c., more complex models). 
Therefore, practitioners need to be cautious when they choose the DIC as a relative model fit 
criterion. Detailed discussions on the limitations and cautions of using the DIC have been docu- 
mented by Spiegelhalter, Best, Carlin, and van der Linde (2014). In this regard, it would be 
valuable to evaluate the performance of the DIC via a simulation study in terms of model com- 
parisons in the presence of hierarchical data structures, as well as Type I errors and power for a 
measurement invariance study under the MLNLM or its variants. 

Second, only the MLNLM with cluster bias was examined in the recovery study by mimick- 
ing real data testing conditions. Additional investigations under the MLNLM without cluster 
bias and/or more extensive simulation studies under various testing conditions, including differ- 
ent sample sizes and test lengths, should be considered for a future study. In addition, it would 
be worthwhile to examine the consequences of ignoring cluster bias (i.e., assuming cluster 
invariance) in measuring individual- and cluster-level latent variables and changes over time. 

Third, cluster bias was examined at the test level, rather than at the individual-item level. 
However, there might be cluster bias for some, but not all, test items. This is referred to as par- 
tial cluster invariance (Jak et al., 2013). Considering partial cluster invariance after examining 
test-level cluster bias might help researchers explore why cluster bias presents based on item 
and cluster information. With the assumption of cluster bias under the MLNLM, the individual- 
and cluster-level latent variables cannot be interpreted in the same way across all levels. The 
interpretation of latent variables at each level would be of practical interest based on the 
individual-item level analysis when item content information is available. 

Fourth, this study followed previous studies and conventional practices in choosing priors. 
However, the effect of priors on parameter estimation (such as the variances of individual- and 
cluster-level latent variables) was not examined thoroughly although some preliminary analyses 
were done before using the selected priors. Therefore, conducting a sensitivity analysis to 
choose proper priors under various simulation conditions would be valuable. 

Fifth, as illustrated in the real data analysis, the effects of instruction (BAU vs. EAI) and 
group memberships (non-MD and MD groups) were examined using the parameter estimates 
from the MLNLM. However, the MLNLM can be extended by incorporating the instruction 
and group variables in the model as person covariates to explain the individual- and cluster- 
level latent variables as in explanatory IRT models (De Boeck & Wilson, 2004), which will be 
a more complex model than the MLNLM but can provide direct estimates of the instructional 
and/or group effects on the latent variables. 

Finally, there exist some concerns about the use of hierarchical models. Compared with the 
HLM, the MLNLM may perform adequately when cluster size, the number of clusters, and 
intraclass correlations are large because there are more parameters in the MLNLM than in the 
HLM. Therefore, further research on these design factors regarding hierarchical structures for 
the MLNLM is still needed. 

The main goal of this study was to present and illustrate the MLNLM to analyze the correct 
response and error types as an IRT model for detecting intervention effects in multilevel longi- 
tudinal intervention data. A major advantage of the MLNLM is its ability to present the results 
based on the correct response and error probability comparisons, which together can provide 
practitioners with a detailed picture of the strengths and weaknesses of the intervention. 
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Notes 


1. Multilevel item response theory (IRT) models assuming the same item discriminations over levels 
can be considered variance component factor models (Rabe-Hesketh, Skrondal, & Pickles, 2004). 

2. For the item parameters, group invariance is assumed in terms of notation. If the group invariance 
assumption is violated, then the g index needs to be included as a subscript in each item parameter, 
implying the parameters are freely estimated across the groups being considered. In addition, a); 
and a; can be identical if there is no cluster bias—that is, exp (Bj, + 01i9jor + O29.) is reduced to 
exp[B;, + a1i(Bjer + Og)]. Finally, oj), O2i, and B;, in Equation 3 can be reduced to aj;, a2;, and B;, 
respectively, with the assumption of time invariance. 

3. rw, Naw, and C;,, can be reduced to Ari, Aviv, and C,,, respectively, with the assumption of time 
invariance. The same constraints pertaining to Equation 2 are applied—namely, me Cay = 9, 
4 Mar = 0, and eh Noiv = 0 (1.e., sum-to-zero constraints for each time point ft) are imposed. 

4. When the individual-level latent variable and cluster-level latent variable are orthogonal and one 
latent variable is considered at each level for each time point, additional constraints on the individ- 
ual- and cluster-level item discriminations are not necessary (Longford & Muthén, 1992, pp. 583- 
584). 

5. As a preliminary analysis, exploratory factor analyses were conducted at each time point in Mplus 
version 7.11 (L. K. Muthén & Muthén, 1998-2015). Although the 20 items can be grouped based on 
the item features (e.g., like vs. unlike items), the one-factor solution provided a good fit to the data. 

6. Error type coding assumed that all items have the same 11 error types. 

7. Before the deviance information criterion (DIC) was applied for model comparisons, the posterior 
distributions of model parameters were examined. The distributions were symmetrical; that is, the 
mean and the median showed similar values. 

8. For this model comparison analysis, time invariance was presumed across all the models. As shown 
in the next section, a preliminary analysis was conducted in terms of measurement invariance checks 
over the time points and the individual- and cluster-level groups. 

9. The multilevel longitudinal nested logit model (MLNLM) without the cluster invariance assumption 
(i.e., MLNLM with cluster bias) can be fitted for the correct response category only. That is, a1; and 
Qo; are freely estimated, while \j;,, and A»2;,, are assumed to be identical. The opposite case can be 
modeled, too. These two models were fitted, and their DIC values were greater than the DIC value 
(29,360) of the best-fitting model (i.e., the MLNLM with cluster bias for the correct response and 
incorrect response categories). 

10. Note that the only difference between the two models is that two different intercept parameters (B,,; 
one for each time point) can be estimated across time points in the weak invariance model, whereas 
the same intercept parameter (8,;) between the two time points is obtained in the strong invariance 
model. 

11. Hereafter, the hat over each parameter represents the estimate of the true parameter. 

12. The means and variances of the individual-level difference scores were 3.57 and 34.74 for the weak 
model, and 3.34 and 37.77 for the strong model, respectively, and the means and variances of the 
cluster-level difference scores were 0.27 and 0.46 for the weak model and 0.24 and 0.49 for the strong 
model, respectively. 

13. Note that the biases for the error categories are naturally forced to zero because of the sum-to-zero 
constraints applied in Equation 4. In addition, the biases and the root mean square errors (RMSEs) 
for the correct response category parameters showed similar patterns. Therefore, we reported the rela- 
tive bias and the RMSE only. 
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